This report describes a within-axonal differential expression (DE) analysis of Axonseq data provided by Jimmy Beckers using the Bioconductor R package DESeq2. DESeq2 estimates the negative binomial distribution for each gene in the dataset to model RNA-seq count data, then conducts DE analysis to identify genes that are significantly differentially expressed between experimental conditions.
This report performs the analysis of the following conditions:
Comparison 1: C9-1 vs C9-1iso within axon For this comparison you select “Condition: C9-1”, “Treatment: None”, “Compartment: axon” versus “Condition: C9-1iso”, “Treatment: None”, “Compartment: axon”
Comparison 2: C9-2 vs C9-2iso within axon For this comparison you select “Condition: C9-2”, “Treatment: None”, “Compartment: axon” versus “Condition: C9-2iso”, “Treatment: None”, “Compartment: axon”
Comparison 3: C9 vs C9iso within axon For this comparison you select “Group: C9”, “ALS mutation: Yes”, “Treatment: None”, “Compartment: axon” versus “Group: C9”, “ALS mutation: No”, “Treatment: None”, “Compartment: axon”
Note: instead of “Group: C9”, “ALS mutation: Yes”, you could also select “Condition: C9-1 + C9-2”
Log fold change is calculated with reference to “iso” groups in DE comparisons
## load libraries
library("dplyr")
library("tidyr")
library("tibble")
library("ggplot2")
library("ggrepel")
library("viridis")
library("DESeq2")
library("RColorBrewer")
library("pheatmap")
library("readr")
library("readxl")
library("writexl")
library("janitor")
library("gridExtra")
library("biomaRt")
library("DT")
## set project variables
proj <- "Within_Axon"
sel_comp <- "axon"
ann_version <- 2
c1 <- "C9_1iso vs C9_1"
c2 <- "C9_2iso vs C9_2"
c3 <- "C9_iso vs C9"
We load the complete Axonseq count matrix of all samples and metadata describing the conditions of these samples. The metadata and matrix counts are subset to contain only Within Axon samples. The metadata (or “colData”) of the samples being compared is displayed in an HTML table below:
## load metadata
readRDS(file = paste0("./data/Jimmy_axonseq_project_metadata_20231129.rds")) %>%
dplyr::filter(!sample %in% c("P19509_2001", "P19509_2039", "P19509_2051"),
## no outliers
compartment == sel_comp) %>% ## only axon samples
dplyr::rename(genotype = group) %>%
dplyr::mutate(group = dplyr::case_when(
grepl("iso", condition) ~ "C9_iso",
grepl("_CTRL", condition) ~ "C9_ASO_CTRL",
grepl("_MO", condition) ~ "C9_ASO",
grepl("C9", condition) ~ "C9",
grepl("TDP43", condition) ~ "TDP43",
grepl("FUS", condition) ~ "FUS")) %>%
dplyr::filter(group %in% c("C9", "C9_iso")) %>% ## only non-ASO, non-TDP, non-FUS samples
dplyr::mutate_all(as.factor) %>%
as.data.frame() -> meta_data
rownames(meta_data) <- meta_data$sample
## load count data
read.table(file = "./data/counts_JB_AXONseq.csv", sep = ";", header = TRUE) %>%
dplyr::rename(genes = names(.)[1]) %>%
dplyr::select(genes, meta_data$sample) %>% ## contains only samples in metadata
tibble::column_to_rownames(var = "genes") %>%
as.matrix() -> cts
## display metadata table
meta_data %>% DT::datatable()
What was not selected: * somal samples * samples
P19509_2001, P19509_2039 and
P19509_2051 were removed because they appear to be outliers
* ASO, TDP and FUS samples
## by condition
dds <- DESeqDataSetFromMatrix(countData = cts, ## the raw gene counts
colData = meta_data, ## our sample metadata
design = ~ condition) ## our design
dds2 <- DESeqDataSetFromMatrix(countData = cts, ## the raw gene counts
colData = meta_data, ## our sample metadata
design = ~ group) ## our design
cat("Number of genes in dds object *before* filtering out rows with zero counts \n")
#> Number of genes in dds object *before* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 57131
For QC reasons, genes with counts of all 0 for all samples (i.e. rows
in the count matrix with no count measurements) are removed before
running DESeq. This is because for rows with all zero
counts no variance can be modeled.
keep <- rowSums(counts(dds)) > 0
dds <- DESeq(dds[keep, ])
keep2 <- rowSums(counts(dds2)) > 0
dds2 <- DESeq(dds2[keep2, ])
cat("Number of genes in dds object *after* filtering out rows with zero counts \n")
#> Number of genes in dds object *after* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 32040
Results tables are generated using the function results,
which extracts a results table with log2 fold changes, p-values and
adjusted p-values. With no additional arguments to results, the log2
fold change and Wald test p-value will be for the first
variable in the design formula, the experiment group will be the
last variable.
res1 <- results(dds, contrast = c("condition", "C9_1iso", "C9_1"), alpha = 0.05)
res2 <- results(dds, contrast = c("condition", "C9_2iso", "C9_2"), alpha = 0.05)
res3 <- results(dds2, contrast = c("group", "C9_iso", "C9"), alpha = 0.05)
PCA is a method of visually identifying the similarity or difference
between samples. PCA rotates the data cloud onto an orthogonal basis
determined by the dimensions of maximal variance. The first two
Principal Components (PCs) usually hold the majority of the variance of
the data. The following plots show the variance stabilized transformed
count matrix samples projected onto the two largest Principal Components
(i.e. PC1 and PC2). DESeq2 recommends two types of PCA
stabilizations applied to the PCA: * vst
“variance stabilizing transformation” * rld “regularized
log transformation”
## perform variance stabilizing transformation and PCA and plot with ggplot2
rld <- DESeq2::vst(dds)
pcaData <- plotPCA(rld, intgroup = c("condition", "compartment"),
returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>%
ggplot(aes(PC1, PC2, color = condition, shape = compartment)) +
geom_point(size = 3) +
geom_text_repel(aes(label = colnames(rld)), force = 5,
arrow = arrow(length = unit(0.03, "npc"),
type = "closed", ends = "first")) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
coord_fixed() +
ggtitle(paste0("RLD PCA ", gsub("_", " ", sel_comp)))
Note: we observe that the rld and vst PCA
stabilization transformations yield very different projections.
## perform regularized log transformation and PCA and plot with ggplot2
rld <- rlogTransformation(dds)
pcaData <- plotPCA(rld, intgroup = c("condition", "compartment"),
returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>%
ggplot(aes(PC1, PC2, color = condition, shape = compartment)) +
geom_point(size = 3) +
geom_text_repel(aes(label = colnames(rld)), force = 5,
arrow = arrow(length = unit(0.03, "npc"),
type = "closed", ends = "first")) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
coord_fixed() +
ggtitle(paste0("VST PCA ", gsub("_", " ", sel_comp)))
Size factors are a method of normalizing used by the DESeq function to normalize the data in terms of sequencing depth. Size factor is the median ratio of the sample over a pseudosample: for each gene, the geometric mean of all samples. Size factors account for differences in sequencing depth are typically centered around 1 (indicating comparable sequencing depth).
dds$sizeFactor %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "sample") %>%
dplyr::rename(size_factors = names(.)[2]) %>%
dplyr::left_join(meta_data, by = "sample") %>%
ggplot(aes(x = sample, y = size_factors, fill = condition)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 4)) +
geom_hline(yintercept = mean(dds$sizeFactor), color = "red", linetype = "dashed")
DESeq2 estimates gene dispersion using an algorithm that first generates gene-wise maximum likelihood estimates (MLEs) that are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend of dispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final maximum a priori (MAP) estimates of dispersion. This results in a “shrinkage” of the noisy gene-wise estimates toward the consensus represented by the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow the dotted line). A more in-depth theoretical explanation of the DESeq2 algorithm can be found here
plotDispEsts(dds, main = "DESeq2 Dispersion plot")
DESeq2 performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic (first value) is found which optimizes the number of adjusted p-values lower than significance level alpha. The adjusted p-values for the genes which do not pass the filter threshold are set to NA. The results also returns the mean of normalized counts (second value).
cat(paste0("Filter threshold value and mean of normalized counts ", c1, "\n"))
#> Filter threshold value and mean of normalized counts C9_1iso vs C9_1
metadata(res1)$filterThreshold
#> 60.10204%
#> 1.959663
cat(paste0("Filter threshold value and mean of normalized counts ", c2, "\n"))
#> Filter threshold value and mean of normalized counts C9_2iso vs C9_2
metadata(res2)$filterThreshold
#> 0%
#> 0.0204094
cat(paste0("Filter threshold value and mean of normalized counts ", c3, "\n"))
#> Filter threshold value and mean of normalized counts C9_iso vs C9
metadata(res3)$filterThreshold
#> 68.3806%
#> 2.250812
The filterThreshold returns the threshold chosen (vertical line in the plots below) by the DESeq2 analysis of the lowest quantile of the filter for which the number of sample rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles. The following diagnostic plot shows the number of rejected samples (y-axis) plotted against quantiles of filter (x-axis).
par(mfrow = c(1, 1))
plot(metadata(res1)$filterNumRej, type = "b", main = c1,
xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res1)$lo.fit, col = "red")
abline(v = metadata(res1)$filterTheta)
par(mfrow = c(1, 1))
plot(metadata(res2)$filterNumRej, type = "b", main = c2,
xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res2)$lo.fit, col = "red")
abline(v = metadata(res2)$filterTheta)
par(mfrow = c(1, 1))
plot(metadata(res3)$filterNumRej, type = "b", main = c3,
xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res3)$lo.fit, col = "red")
abline(v = metadata(res3)$filterTheta)
The following plot shows the number of frequency of counts (y-axis) against p-values between 0 and 1 (x-axis).
par(mfrow = c(1, 1))
hist(res1$pvalue, col = "lavender", xlab = "p-values", main = c1)
par(mfrow = c(1, 1))
hist(res2$pvalue, col = "lavender", xlab = "p-values", main = c2)
par(mfrow = c(1, 1))
hist(res3$pvalue, col = "lavender", xlab = "p-values", main = c3)
print(x = paste0(c1, " ", gsub("_", " ", sel_comp)))
#> [1] "C9_1iso vs C9_1 axon"
print(summary(res1))
#>
#> out of 32040 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up) : 30, 0.094%
#> LFC < 0 (down) : 0, 0%
#> outliers [1] : 2130, 6.6%
#> low counts [2] : 18315, 57%
#> (mean count < 2)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#>
#> NULL
print(x = paste0(c2, " ", gsub("_", " ", sel_comp)))
#> [1] "C9_2iso vs C9_2 axon"
print(summary(res2))
#>
#> out of 32040 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up) : 5, 0.016%
#> LFC < 0 (down) : 7, 0.022%
#> outliers [1] : 2130, 6.6%
#> low counts [2] : 0, 0%
#> (mean count < 0)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#>
#> NULL
print(x = paste0(c3, " ", gsub("_", " ", sel_comp)))
#> [1] "C9_iso vs C9 axon"
print(summary(res3))
#>
#> out of 31453 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up) : 121, 0.38%
#> LFC < 0 (down) : 0, 0%
#> outliers [1] : 0, 0%
#> low counts [2] : 21909, 70%
#> (mean count < 2)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#>
#> NULL
We annotate the DE results with human Ensembl and
Entrez IDs by accessing the BioMart
database. The saved .RDS object of the biomaRt
query results are loaded.
## annotate DE results with Biomart database
# listAttributes(mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl")))
# getBM(attributes = c("ensembl_gene_id", "entrezgene_id", "external_gene_name"),
# mart = useDataset("hsapiens_gene_ensembl", useMart("ensembl"))) %>%
# dplyr::rename(ensembl_id = ensembl_gene_id,
# entrez_id = entrezgene_id,
# genes = external_gene_name) %>%
# dplyr::filter(stringr::str_length(genes) > 1,
# !duplicated(ensembl_id)) -> human_biomart
# saveRDS(object = human_biomart, file = "./data/human_biomart.rds")
human_biomart <- readRDS(file = "./data/human_biomart.rds")
The lists of up and downregulated genes in C9orf72 MNs (bulk-RNA-seq) that from [Selvaraj, B.T, Livesey, M. R et al., 2018][https://pubmed.ncbi.nlm.nih.gov/29367641/] will be used to compare enriched DE genes in our data.
# Selvaraj, B.T, Livesey, M. R et al.,
# Supplementary data -1
readxl::read_excel("~/Documents/tmp/dacruz/202404_jimmy/data/41467_2017_2729_MOESM3_ESM.xlsx",
skip = 3) %>%
dplyr::rename(genes = gene_name) %>%
dplyr::mutate(C9_upreg_marker = "up") %>%
as.data.frame() -> df_up
names(df_up) <- gsub("-", "_", names(df_up))
df_up %>% DT::datatable()
# Selvaraj, B.T, Livesey, M. R et al.,
# Supplementary data: 2
readxl::read_excel("~/Documents/tmp/dacruz/202404_jimmy/data/41467_2017_2729_MOESM4_ESM.xlsx",
skip = 3) %>%
dplyr::rename(genes = gene_name) %>%
dplyr::mutate(C9_downreg_marker = "down") %>%
as.data.frame() -> df_down
names(df_down) <- gsub("-", "_", names(df_down))
df_down %>% DT::datatable()
The annotated DE results are arranged by
p-adjusted value. Normalized counts from the are added to
the right of the DE statistics.
## RES1
merge(as.data.frame(res1), as.data.frame(counts(dds, normalized = TRUE)),
by = "row.names", sort = FALSE) %>%
dplyr::rename(genes = names(.)[1]) %>%
dplyr::left_join(human_biomart, by = "genes") %>%
dplyr::mutate(
non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
everything()) %>%
dplyr::arrange(padj) %>%
as.data.frame() -> res_df1
## readjust sig values
# res_df1$pvalue[res_df1$pvalue == 0] <- 1e-300
# res_df1$padj[res_df1$padj == 0] <- 1e-300
## RES2
merge(as.data.frame(res2), as.data.frame(counts(dds, normalized = TRUE)),
by = "row.names", sort = FALSE) %>%
dplyr::rename(genes = names(.)[1]) %>%
dplyr::left_join(human_biomart, by = "genes") %>%
dplyr::mutate(
non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
everything()) %>%
dplyr::arrange(padj) %>%
as.data.frame() -> res_df2
## readjust sig values
# res_df2$pvalue[res_df2$pvalue == 0] <- 1e-300
# res_df2$padj[res_df2$padj == 0] <- 1e-300
## RES3
merge(as.data.frame(res3), as.data.frame(counts(dds, normalized = TRUE)),
by = "row.names", sort = FALSE) %>%
dplyr::rename(genes = names(.)[1]) %>%
dplyr::left_join(human_biomart, by = "genes") %>%
dplyr::mutate(
non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
everything()) %>%
dplyr::arrange(padj) %>%
as.data.frame() -> res_df3
## readjust sig values
# res_df2$pvalue[res_df2$pvalue == 0] <- 1e-300
# res_df2$padj[res_df2$padj == 0] <- 1e-300
A MA plot illustrates log-fold expression change between two groups of samples, created by transforming and the data onto two scales: M (the log of the ratio of level counts for each gene between two samples) and A (the average level counts for each gene across the two samples) scales. MA plots demonstrates the difference between samples in terms of signal intensities of read counts. In this type of plot, genes with similar expression levels in two samples will appear around the horizontal line y = 0 (red line). The following MA plot illustrates log-fold expression change for each comparison after the DESeq2 analysis. Significant genes (P < 0.05) are highlighted in red.
results1 <- as.data.frame(res_df1)
results1[is.na(results1)] <- 0.99 # change NA results to 0.99 for correct MAplot
results1 %>%
ggplot(aes(x = baseMean, y = log2FoldChange)) +
geom_point(aes(colour = padj < 0.05), size = 0.5) +
scale_colour_manual(name = 'padj < 0.05',
values = setNames(c('red','black'), c(TRUE, FALSE))) +
scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
geom_smooth(colour = "red") +
geom_abline(slope = 0, intercept = 0, colour = "blue") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
ggtitle(paste0(c1, " in ", gsub("_", " ", sel_comp)))
results2 <- as.data.frame(res_df2)
results2[is.na(results2)] <- 0.99 # change NA results to 0.99 for correct MAplot
results2 %>%
ggplot(aes(x = baseMean, y = log2FoldChange)) +
geom_point(aes(colour = padj < 0.05), size = 0.5) +
scale_colour_manual(name = 'padj < 0.05',
values = setNames(c('red','black'), c(TRUE, FALSE))) +
scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
geom_smooth(colour = "red") +
geom_abline(slope = 0, intercept = 0, colour = "blue") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
ggtitle(paste0(c2, " in ", gsub("_", " ", sel_comp)))
results3 <- as.data.frame(res_df3)
results3[is.na(results3)] <- 0.99 # change NA results to 0.99 for correct MAplot
results3 %>%
ggplot(aes(x = baseMean, y = log2FoldChange)) +
geom_point(aes(colour = padj < 0.05), size = 0.5) +
scale_colour_manual(name = 'padj < 0.05',
values = setNames(c('red','black'), c(TRUE, FALSE))) +
scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
geom_smooth(colour = "red") +
geom_abline(slope = 0, intercept = 0, colour = "blue") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
ggtitle(paste0(c3, " in ", gsub("_", " ", sel_comp)))
A volcano plot is a type of scatter plot that is used to quickly identify genes that display large magnitude changes that are also statistically significant. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis. This results in data points with low p-values appearing toward the top of the plot. The x-axis is the log of the fold change between the two experimental conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (on the left or right of center) as well as high statistical significance (toward the top). The following Volcano plot shows log of the fold change and negative log of the p-values for each comparison. Significant genes (P < 0.05) with log2 fold change (> 1) are highlighted in red.
res_df1 %>%
# dplyr::filter(!is.na(pvalue),
# perc_non_zero_de > 0.66) %>%
dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
sig_group = as.factor(dplyr::case_when(
log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c1),
log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c1),
TRUE ~ "not significant"))) %>%
ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
geom_point(alpha = 0.75, size = 0.75) +
geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
xlab("log2 fold change") +
ylab("-log10 p-value") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
ggtitle(paste0(c1, " in ", gsub("_", " ", sel_comp)))
res_df2 %>%
# dplyr::filter(!is.na(pvalue),
# perc_non_zero_de > 0.66) %>%
dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
sig_group = as.factor(dplyr::case_when(
log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c2),
log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c2),
TRUE ~ "not significant"))) %>%
ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
geom_point(alpha = 0.75, size = 0.75) +
geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
xlab("log2 fold change") +
ylab("-log10 p-value") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
ggtitle(paste0(c2, " in ", gsub("_", " ", sel_comp)))
res_df3 %>%
# dplyr::filter(!is.na(pvalue),
# perc_non_zero_de > 0.66) %>%
dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
sig_group = as.factor(dplyr::case_when(
log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c3),
log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c3),
TRUE ~ "not significant"))) %>%
ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
geom_point(alpha = 0.75, size = 0.75) +
geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
xlab("log2 fold change") +
ylab("-log10 p-value") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
ggtitle(paste0(c3, " in ", gsub("_", " ", sel_comp)))
Count plots are created for the top 5 unfiltered DE genes for each comparison.
res_df1 %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = condition)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c1, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
res_df2 %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = condition)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c2, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
res_df3 %>%
dplyr::filter(!duplicated(genes)) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = group)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c3, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
Count plots are created for the top 5 DE genes for each comparison,
filtered by perc_non_zero_de > 0.66, or in
other words, all samples contain 66.6% “complete cases” in terms of
expression for each gene.
res_df1 %>%
dplyr::filter(!duplicated(genes),
perc_non_zero_de > 0.66) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = condition)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c1, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
res_df2 %>%
dplyr::filter(!duplicated(genes),
perc_non_zero_de > 0.66) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = condition)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c2, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
res_df3 %>%
dplyr::filter(!duplicated(genes),
perc_non_zero_de > 0.66) %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("P19")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
ggplot(aes(x = sample, y = log10(count), color = group)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ",
c3, " in ", gsub("_", " ", sel_comp))) +
facet_grid(~genes)
Below are three html tables displaying the results the DE comparisons. These tables are interactive and be queried for specific proteins or sorted by column.
Note: these are the results filtered for genes with
perc_non_zero_de > 0.66. To view the full unfiltered DE
results, please refer to the Excel spreadsheets.
Some other notes on how DESeq2
calculates p-values set to NA. Some values in the results table can be
set to NA for one of the following reasons: * If within a
row, all samples have zero counts, the baseMean column will be zero, and
the log2 fold change estimates, p-value and adjusted p-value will all be
set to NA. * If a row contains a sample with an extreme
count outlier then the p-value and adjusted p-value will be set to
NA. These outlier counts are detected by Cook’s distance. *
If a row is filtered by automatic independent filtering, for having a
low mean normalized count, then only the adjusted p-value will be set to
NA.
res_df1 %>%
dplyr::filter(perc_non_zero_de > 0.66) %>%
dplyr::select(-c(contains("_id"))) %>%
dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
DT::datatable()
res_df2 %>%
dplyr::filter(perc_non_zero_de > 0.66) %>%
dplyr::select(-c(contains("_id"))) %>%
dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
DT::datatable()
res_df3 %>%
dplyr::filter(perc_non_zero_de > 0.66) %>%
dplyr::select(-c(contains("_id"))) %>%
dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
DT::datatable()
Results are saved as Excel .xslx and .rds files in two versions of each DEG list:
These Excel files have the following columns:
genes = Hugo gene symbol (official gene name)entrez_id = Entrez gene IDensembl_id = Ensembl gene IDnon_zero_de = number of non-zero values across gene
rowsperc_non_zero_de = percent non-zero values across gene
rows out of all samplesbaseMean = DESeq2 average mean count per genelog2FoldChange = log2 normalized fold change between
two DE conditionslfcSE = standard Normal distribution to generate a
two-tailed p-valuestat = the difference in deviance between the reduced
model and the full model, which is compared to a chi-squared
distribution to generate a pvaluepvalue = p-valuepadj = Benjamini-Hochberg FDR p-adjusted valueC9_upreg_marker = upregulated in Selvaraj, B.T,
Livesey, M. R et al., 2018 mutant C9 motor neuronsC9_downreg_marker = downregulated in Selvaraj, B.T,
Livesey, M. R et al., 2018 mutant C9 motor neuronsNote: Cook’s Distance could not be calculated because we have condition groups of less than 4 samples.
## populate list of DE results
de_list <- list(res_df1,
res_df1 %>% dplyr::filter(perc_non_zero_de > 0.66),
res_df2,
res_df2 %>% dplyr::filter(perc_non_zero_de > 0.66),
res_df3,
res_df3 %>% dplyr::filter(perc_non_zero_de > 0.66))
## name list headers to become sheet names
names(de_list) <- c(paste0(c1, " in ", sel_comp, " unfiltered DEGs"),
paste0(c1, " in ", sel_comp, " filtered DEGs"),
paste0(c2, " in ", sel_comp, " unfiltered DEGs"),
paste0(c2, " in ", sel_comp, " filtered DEGs"),
paste0(c3, " in ", sel_comp, " unfiltered DEGs"),
paste0(c3, " in ", sel_comp, " filtered DEGs"))
## save as Excel and RDS of DEG lists
saveRDS(object = de_list, file = paste0(
"./data/", proj, "_DEG_res_v", ann_version, "_", Sys.Date(), ".rds"))
writexl::write_xlsx(x = de_list, path = paste0(
"./data/", proj, "_DEG_res_v", ann_version, "_", Sys.Date(), ".xlsx"))
Session Info
sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.4.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Brussels
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] DT_0.33 biomaRt_2.58.2
#> [3] gridExtra_2.3 janitor_2.2.0
#> [5] writexl_1.5.0 readxl_1.4.3
#> [7] readr_2.1.5 pheatmap_1.0.12
#> [9] RColorBrewer_1.1-3 DESeq2_1.42.1
#> [11] SummarizedExperiment_1.32.0 Biobase_2.62.0
#> [13] MatrixGenerics_1.14.0 matrixStats_1.2.0
#> [15] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
#> [17] IRanges_2.36.0 S4Vectors_0.40.2
#> [19] BiocGenerics_0.48.1 viridis_0.6.5
#> [21] viridisLite_0.4.2 ggrepel_0.9.5
#> [23] ggplot2_3.5.0 tibble_3.2.1
#> [25] tidyr_1.3.1 dplyr_1.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.2 bitops_1.0-7 rlang_1.1.3
#> [4] magrittr_2.0.3 snakecase_0.11.1 compiler_4.3.3
#> [7] RSQLite_2.3.6 mgcv_1.9-1 png_0.1-8
#> [10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
#> [13] crayon_1.5.2 fastmap_1.1.1 dbplyr_2.5.0
#> [16] XVector_0.42.0 labeling_0.4.3 utf8_1.2.4
#> [19] rmarkdown_2.26 tzdb_0.4.0 purrr_1.0.2
#> [22] bit_4.0.5 xfun_0.43 zlibbioc_1.48.2
#> [25] cachem_1.0.8 jsonlite_1.8.8 progress_1.2.3
#> [28] blob_1.2.4 highr_0.10 DelayedArray_0.28.0
#> [31] BiocParallel_1.36.0 parallel_4.3.3 prettyunits_1.2.0
#> [34] R6_2.5.1 bslib_0.7.0 stringi_1.8.3
#> [37] lubridate_1.9.3 jquerylib_0.1.4 cellranger_1.1.0
#> [40] Rcpp_1.0.12 knitr_1.45 splines_4.3.3
#> [43] Matrix_1.6-5 timechange_0.3.0 tidyselect_1.2.1
#> [46] rstudioapi_0.16.0 abind_1.4-5 yaml_2.3.8
#> [49] codetools_0.2-20 curl_5.2.1 lattice_0.22-6
#> [52] withr_3.0.0 KEGGREST_1.42.0 evaluate_0.23
#> [55] BiocFileCache_2.10.2 xml2_1.3.6 Biostrings_2.70.3
#> [58] filelock_1.0.3 pillar_1.9.0 generics_0.1.3
#> [61] RCurl_1.98-1.14 hms_1.1.3 munsell_0.5.1
#> [64] scales_1.3.0 glue_1.7.0 tools_4.3.3
#> [67] locfit_1.5-9.9 XML_3.99-0.16.1 grid_4.3.3
#> [70] crosstalk_1.2.1 AnnotationDbi_1.64.1 colorspace_2.1-0
#> [73] nlme_3.1-164 GenomeInfoDbData_1.2.11 cli_3.6.2
#> [76] rappdirs_0.3.3 fansi_1.0.6 S4Arrays_1.2.1
#> [79] gtable_0.3.4 sass_0.4.9 digest_0.6.35
#> [82] SparseArray_1.2.4 farver_2.1.1 htmlwidgets_1.6.4
#> [85] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
#> [88] httr_1.4.7 bit64_4.0.5
References:
Love, M.I., Huber, W., Anders, S. (2014) “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15:550. 10.1186/s13059-014-0550-8
Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11:R106. http://genomebiology.com/2010/11/10/R106.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi: 10.1093/nar/gkv007.